Photo-realistic visual speech synthesis based on AAM features and an articulatory DBN model with constrained asynchrony

نویسندگان

  • Peng Wu
  • Dongmei Jiang
  • He Zhang
  • Hichem Sahli
چکیده

This paper presents a photo realistic visual speech synthesis method based on an audio visual articulatory dynamic Bayesian network model (AF_AVDBN) in which the maximum asynchronies between the articulatory features, such as lips, tongue and glottis/velum, can be controlled. Perceptual linear prediction (PLP) features from the audio speech and active appearance model (AAM) features from mouth images of the visual speech are adopted to train the AF_AVDBN model for continuous speech. An EM-based optimal visual feature learning algorithm is deduced given the input auditory speech and the trained AF_AVDBN parameters. Finally, photo realistic mouth images are synthesized from the learned AAM features. In the experiments, mouth animations are synthesized for 30 connected digit audio speech sentences. Objective evaluation results show that the learned visual features using AF_AVDBN track the real parameters much more closely than those from the audio visual state synchronous DBN model (SS_DBN, the DBN implementation of multi-stream Hidden Markov Model), as well the state asynchronous DBN model (SA_DBN). Subjective evaluation results show that by considering the asynchronies between articulatory features in the AF_AVDBN (as well between audio and visual states in the SA_DBN), the synchronization between the audio speech and mouth animations are well obtained. Moreover, since AF_AVDBN captures the dynamic movements of articulatory features and model the pronunciation process more precisely, the accuracy of the mouth animations from the AF_AVDBN is much higher than those from the SA_DBN and the SS_DBN models, very accurate, clear, and natural mouth animations can be obtained through the AF_AVDBN model and AAM features.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Photo-Realistic Mouth Animation Based on an Asynchronous Articulatory DBN Model for Continuous Speech

This paper proposes a continuous speech driven photo realistic visual speech synthesis approach based on an articulatory dynamic Bayesian network model (AF_AVDBN) with constrained asynchrony. In the training of the AF_AVDBN model, the perceptual linear prediction (PLP) features and YUV features are extracted as acoustic and visual features respectively. Given an input speech and the trained AF_...

متن کامل

Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models

In this paper we describe a speech processing system for Polish which utilizes both acoustic and visual features and is based on Dynamic Bayesian Network (DBN) models. Visual modality extracts information from speaker lip movements and is based alternatively on raw pixels and discrete cosine transform (DCT) or Active Appearance Model (AAM) features. Acoustic modality is enhanced by using two pa...

متن کامل

Speech Attribute Detection Using Deep Learning

In this work we present alternative models for attribute speech feature extraction based on the two state-of-the-art deep neural networks: convolutional neural networks (CNN) and feed-forward neural network with pretraining using stack of restricted Boltzmann machines (DBN-DNN). These attribute detectors are trained using data-driven approach across all languages in the OGI-TS multi-language te...

متن کامل

An Analysis-by-Synthesis Approach to Vocal Tract Modeling for Robust Speech Recognition Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical and Computer Engineering

In this thesis we present a novel approach to speech recognition that incorporates knowledge of the speech production process. The major contribution is the development of a speech recognition system that is motivated by the physical generative process of speech, rather than the purely statistical approach that has been the basis for virtually all current recognizers. We follow an analysis-by-s...

متن کامل

Articulatory feature-based pronunciation modeling

Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech, and it has been very difficult to mitigate in traditional phonebased approaches to speech recognition. An altern...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011